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ABSTRACT 



This paper is divided into two distinct parts. The first is a 
summary of the general theory of information retrieval, A comprehen- 
sive mathematical model is described in terms of the theory of Boolean 
lattices, which serves to unify and make precise the basic problem of 
information retrieval. All possible basic methods of coding informa- 
tion for storage and retrieval are briefly described and contrasted. 
Another mathematical model for information retrieval based on linear 
graphs and stochastic processes is briefly described as an alternative 
to the lattice model. The appendices contain a survey of lattice 
theory, and an example of superimposed coding. 

The second part of this paper is a detailed example of the appli- 
cation of information retrieval techniques utilizing the facilities of 
the USNPGS Computer Center to handle a problem involving the technical 
reports section of the school library. 

The writer wishes to express his appreciation for the assistance 
and encouragement given him by Professors Elmo J. Stewart, Willard E. 
Bleick, and Edward Ward of the U. S. Naval Postgraduate School in this 
investigation . 
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INTRODUCTION 

1, Definition and Scope 

Part One of this paper will deal chiefly with the theory of 
information retrieval in general. However the discussion will be 
restricted to the extent that those problems will be emphasized for 
whose solution large commercially available electronic computers 
are readily adaptable. We define information retri^ as an opera- 
tion p er formed o n a stored file of indiyidual items containing coded 
or clas sified descri^ions of their referrent. The operation 
consists of selecting those items which satisfy a given set of 
search criteria and then presenting the individual references to 
the searcher. In electronic! computer systems, for instance, the 
file might consist of magnetic tape on which are stored coded in- 
dividual items. The retrieval process then would consist of a 
sequence of operations under computer control designed to select 
automatically every item which meets all search criteria. As an 
adjunct to the search process, the reference portion of the re- 
trieved item might be machine edited and printed in some convenient 
form. The overall system might further undertake to automatically 
process the file itself by deleting and/or correcting old material 
and adding new. Recent experiences have shown that the use of 
electronic computers can provide fast, accurate, convenient, and 
inexpensive retrieval, j^lj 

There is a very wide range of information retrieval problems, 
however, and all of them are not suitable for handling by electronic 
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computers o Three determining factors may be said to be: the si^ze 

of the information file, th e fre quency with which questions are 
posed for retrieval, and the complexity of the criteria by means 
of which desired information is selected. It so happens that, just 
as the increasing magnitudes of these three factors point toward 
the utilization of a large digital computer, so also do they point 
out the need for an overall, unified theory from which the problem 
of information retrieval may be attacked, j^2j 
2 . Criteria and Approach 

Before proceeding to discuss possible approaches to such a 
theory, we would first like to consider what one should expect 
from a satisfactory information retrieval system. The 
will use an automated information retrieval system by asking 
questions and receiving reports will be primarily interested in 
accuracy and speed. But he would also prefer a system which will 
make few demands on him as far as learning new techniques is con- 
cerned, and he would somehow like to reserve the privilege of 
browsing through the file if it were possible, since one function 
of a collection of documents is to stimulate new and unexpected 
approaches to his problem. 

From the librarian* s or documenta lists* viewpoint the daily 
routine activities necessary to operate the system must be performed 
with the utmost of convenience. Therefore, the manner in which the 
searching and file maintenance entries are prepared must be as 
simple and direct as possible. 

A customer desiring to use an information retrieval system 



‘^reason Who 
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actuates it by presenting a “prescription" for the information that 
he wants o The retrieval system responds to this prescription by 
indicating to the customer a set of documents from the collection 
which presumably will furnish the information he desires « In other 
words, an information retrieval system translates or transforms 
the customer's prescription into a set of documents. j^3j 
From this operational point of view we shall begin, in the next 
chapter, the construction of a mathematical model for the infor- 
mation retrieval problem. 
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V. 



II 



A LATTICE MODEL FOR DOCUMENTS^ 

1, Sets of Documents 

In the following model, we shall consider a library to consist 
essentially of a collection of n documents (books, pamphlets, period- 
icals, etc») which we shall call U. Every **batch*' of these, selected 
by any means from among the whole collection will be called a set 

91 

of documents. An example of such a set would be all documents 
(in the library) written by J, G, Jones/* Another example would 
be **all documents (in the library) bound in red vellum/* We say 
that two such sets are identical if they contain precisely the same 
documents. Clearly any possible such set can be defined by enumera- 
ting its contents, i,e,; by submitting a list of documents by their 
**call numbers** j^4j (or even by title if there are no duplications 
of titles in the whole collection,) 

Now consider all possible such sets of documents taken from 
the library. The aggregate of all such sets is a new collection, 
a set of sets, which we shall call L. L has 2*^ distinct members in- 
cluding the null set 0, containing no documents, the set of all 
documents U; n sets each containing one document; and in general 
(^) distinct sets consisting of precisely m documents (for m n). 

Thus for each choice of documents that could be taken from the library, 
there exists a member of L, A moment* s reflection will show that 

^In this chapter and the next references will be made to 
definitions and theorems in Appendix A. 
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most of the members in L represent heterogeneous sets of documents 
with no unifying similarities in their subject content] such sets 
are not a useful output to any input retrieval request* This fact 
constitutes a central problem in information retrieval* 

2. Requests for Information 

A request for information from a library will be viewed in 
this model as a * prescription** consisting of logical constraints 
which describe a desired set of documents* However the set need 
not actually exist in any library* We shall have occasion to refer 
to the collection of all possible such requests, denoted by R, 
and we shall refer to an individual request as a member of R» 

The purpose of any retrieval system is to receive as an input 
a request or query which can be represented by a member in R, abd 
to convert this input to an output which consists of a citation to, 
or a set of copies from, some set of documents which is in turn 
represented by a member in L* Any retrieval system defines a trans« 
formation or mapping T, which takes each member r in R onto some 
member 1 in L* In order to characterize such mappings, and to select 
the most useful one in a particular application, it is first neces“ 
sary to study the structure of the aggregates R and L, both as 
algebraic systems and as topological spaces* We have seen how the 
individual documents in a library may be thought of as distinct 
elements, for they are uniquely distinguished at least by their 
respective call numbers. But they are not ordered in any sense 
except arbitrarily by, say, location* Any set of documents from a 
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library, ioe,; any set in L; is definable, in general at least by 
enumeration o Two such sets are distinct if one contains at least 
one document (element) not contained in the other. They may be 
partially ordered (DEF.II) by inclusion, but they are not, in gen- 
eral, linearly ordered (DEF.III) since some at least are disjoint. 

Any subset (dEF.I) of members of L has a greatest lower bound 
(DEF.VI), namely the intersection of its member sets, which is 
also a member of L. Every subset of members from L has a unique 
least upper bound (DEF.V), namely the union of its members, which 
is also a set in L. Therefore L is a finite lattice j^5j (DEFS.VII). 
In fact L is a Boolean (DEF.XIII) algebra under the usual set of 
theoretic operations of union, a+b, intersection, ab, and comple- 
tation a* . 

3. Classification Systems 

From the sets of L definable by enumeration, any system of 
classification serves simply to select certain distinguished sub- 
sets of L. The various methods of classification for cataloging 
documents have but one characteristic in common, i.e., they are 
exhaustive in the sense that each document lies in at least one 
of the distinguished subsets of L. 

Now let us examine the structure of the subsets distinguished 
under various methods of classification of documents. 



6 








I 



1 




Source 



We begin with the simplest case of classification by source, 

i.e., by publisher or by author. In the latter case, each subset 

is distinguished by the name of an author, and consists of all the 

books in the library which were written by the designated author. 

These distinguished subsets have no ordering except an arbitrary 

one such as alphabetic. They are all disjoint, provided we make 

the convention that a joint authorship subset does not include the 

subsets of its individual authors, but only contains those docu« 

ments written by the group jointly. If we denote the set of subsets 

* 

distinguished as to source by A, then A , its closure under union, 
intersection, and complementation, is indeed a Boolean lattice and 
a sublattice of L. (dEF.IX). 

Date 

Now consider classification by date of publication. This 
could be viewed as the same as that by source, that isj in the 
sense all documents published on said date*’. However, it i« more 

interesting and useful to think of this as a classification in terms 
of since said date*’ or ’’before said date.** Either system distin- 
guishes subsets of L which form a (ascending or descending) chain 
(DEF. XV) in L, (D , since date; or D,, before date.) which is a 
sublattice of L. Either or alone is strongly distributive 

(DEF . XV ) but not Boolean. It requires the union of both, D + D, 

s D 

= D, to contain all complements (dEF. XII ) and to form a Boolean 
algebra. 
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Uniterms 



The system of classification known as coordinate indexing 
selects distinguished subsets of L by the use of^^cjrigtoj^ or 
uniterms which are words, word roots, or short phrases which des- 
cribe^ in part, a document’s contents. In theory, any number of 
uniterms may be assigned to a particular document, and the list 
from which they are taken may be either permanently fixed or 
open-ended. Let G be the set of all subsets each of which is dis- 
tinguished by a single descriptor. Then G is partially ordered 
by set inclusion. But G is not a lattice since the union of those 
documents relating to one term, g^, with those documents relating 
to another, g^, i,e; g^ + g^, does not always distinguish a set, 
g^, already selected by some other single descriptor. However, if 

/ X * 

we take the closure (DEF, XIII) of G, G , then we obtain a Boolean 
sublattice of L which contains all sets distinguished by any 
logical combination of descriptors. One school of thought on co- 
ordinate indexing |^6, ?j considers it highly desirable that 
no set of documents distinguished by a particular descriptor be 
identical to or wholly contained in another set distinguished by 
a different descriptor. If this condition holds, then all the 
irreducible elements (DEF, XVII ) of G are precisely all the docu- 
ments and each is representable by an irredudant intersection 

(DEF, XVIII ) of members of G„ In fact, every member of G is a 

* * 
point (DEF, XIX) in G , and conversely, every point in G is a 

member of G, 
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Subject Heading 

The most commonly used and most complex method of classifica- 
tion employs a hierarchy of subject headings * There are two 

basic kinds j^3j o In a strictly hierarchial system no cross 
references are permitted; the linear graph of such a system is a 
tree such as that shown in figure 2 A (ignoring the dotted lines.) 
The sets of L distinguished by a strictly hierarchical classifica- 
tion form by themselves a lattice denoted by which is however^ 
not a sublattice of L. H is itself closed under intersection and 
union and is strongly distributive. But for h in h^ = I -h 
is not in general a member of R. Hence H is not Boolean. The 
structure of H is unsatisfactory for most retrieval purposes for 
another reason. Consider set a through r in H (refer to figure 2.1 
note that a — j, d — but jk=0. Then the least upper bound of j 
and d in H is j-i-dsn=j+k e. On the other hand their l.u.b. in L 
is j+d=l some member of L not in H. Normally, if we request the 
logical disjunction of the sets distinguished by the two concepts 
d,j, that is; j ord, we mean the set 1 rather than n for we do 
not wish to include e. This difficulty arises from H not being a 
sublattice of L. Thus it is necessary to consider the closure of 
H, namely H which is a (Boolean) sublattice of L. 

The other type of hierarchical structure which permits cross 
references is called weakly hierarchical. Such a system is sho\m 
in figure 5«2 or in figure 2 A if we include the dotted lines. 
Consider the subset H^ of sets of L distinguished by a weakly 



9 



OP M iSeARCH /^/c/ir/0// 







10 



tQU U ^0£>O At^/VT'S 



hierarchical classification. Let H be the subset of sets of L dis» 
tinguished by the same classification exclusive of its cross refer- 
ences. H is contained in H and the closure of either is the same, 

w ^ 

i.e.j H = H . What then, is the lattice- theoretic distinction 
between them? From a graphical point of view it may be seen that 
there may be in more than one path between an individual docu- 
ment and the vertex I. Any such path is a chain of sets, but since 

H is a modular (dEF.X) (in fact strongly distributive) lattice 
w 

any two such chains have equivalent refinements (DEF.XIV and 
Theorem IV ). One of the chains in H between a given individual 
document and I will always be the unique chain which exists in H, 
hence any chain in (i.e., any path to a document through H^) 
has a refinement equivalent- to the unique chain in H, Only if we 
conduct a stepwise search does the existence of a chain of possibly 
lower dimension in gain us any advantage. If we require the 
flexibility to handle searches for all logical functions of the 
various distinguished sets we must still go to the closure of the 
lattice, which is H in either case. 

This conclusion is not very satisfying since we feel intuitively 
that cross references should be of considerable more use. The maze 
model discussed in Chapter k attempts to utilize cross references 
more effectively. 

As far as the lattice model is concerned, however, it has been 
shown that any standard classification system which distinguishes 
certain sets from among all possible sets of documents, generates 
as its closure a Boolean lattice. 
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4. The Catalog Card 

Now we will look at the catalog card or other clerical record 
which represents an individual document. We shall take the view- 
point that such a record is essentially a logical function which 
relates subsets of L distinguished by the same or different methods 
of classification. This is best explained by an example. 

Figure 2.2 

Typical Library Catalog Card 



SEMI-CONDUCTORS 

E8151 D3 

DeFrance, Joseph J. 

Electron tubes and semiconductors. 
Englewood Cliffs N.J., Prentice-Hall, 

1958. 

288po illus 9x6in. 

ELECTRON TUBES 



In figure 2.2 there is an example of a library catalog card. 
This card classifies the book it represents as to subject (two 
headings) author, publisher, date and place of publication, number 
of pages, size and illustrations. The implication is that this 
document is simultaneously a member of a particular set distin- 
guished by each of the several methods of classification |^4j 
For instance, it is both on the subject of semi-conductors and 
authored by J, J. DeFrance, We have here the intersection of sets 
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distinguished by different methods of classification. For instance^ 
the product ha where h is in and a is in A. We are required to 
consider the set of all possible such products, and hence the 
cardinal product of and A. In the practical case, it is unneces- 
sary to consider any other logical operation among members of dif- 
ferent classification systems. For instance, a catalog card would 
not normally specify that the volume in question was either pub- 
lished by Prentice-Hall, or else had 288 pages, or both. This is 
reflected by the model in that only the cardinal product (DEF.XX) 
of two lattices is again a lattice. (Theorem IX). Neither the 
cardinal sum nor any of the ordinal operations preserve all the 
structure we require. 

If we consider the Boolean closure of the cardinal product 
of two or more classification systems, it is identical with the 
cardinal product of the Boolean closure of the several systems, 
e.g.: (ha)* = H*A*. (Theorem IX ). 

We can now define the Boolean lattice B as the cardinal prod- 
uct of the several Boolean lattices generated (as described above) 
by the various classification systems used in the composition of the 
library catalog card. In B, every document is a point (DEF.XIX). 
However, a catalog card is, in general, only a redundant representa- 
tion of the document it describes. For instance, in one example, it 
may happen that the author, DeFrance, has never had a book published 
by any other company than this one, i,e.: Prentice-Hall. Then the 

M V 

set distinguished by all books published by Prentice«Hall may be 
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eliminated from the representation of this document without permittin 
that representation (as the intersection of subsets) to include any 
other documents* In this regard, classification by pagination and 
place of publication, for instance, are almost always redundant* 

5 . Summary 

To summarize thus far s each document in a library may be 
represented as a point in a Boolean lattice which is the direct 
product of the closures of the sets of subsets of documents dis- 
tinguished by the various systems of classification employed* In 
the following section we turn our attention to the space of requests 
for information from the library and define a mapping (the retrieval 
function) between it and the space of documents. 
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A LATTICE MODEL FOR RETRIEVAL 
1. The Idealized Request 

We have previously referred to the users request for informa- 
tion as a prescription o Of course it is not always presented in 
that way. It may be vague and even self-contradictory. This gives 
rise to a basic and very difficult task x<rhich is also central to 
the automatic translation of languages, that is: by what rules 

may we so formalize all human communications as to make their 
meanings always unmistakable? |^2j 

This problem is beyond the scope of this paper and it will 
be bypassed by the following assumption. We assume that any raw 
query can be converted or transformed by unspecified operations 

i 1 if 

into an ideal request which obeys the rules of formal logic and 
has a prescribed and predetermined format. We shall take the 
approach that such an idealized request may be represented by a 
logical combination of constraints on possible classification 
systems. j^sj We shall call these constraints admissible if they 
embody logical operations conforming to the inherent structure of 
the particular classification system to which they relate. 

Otherwise they will be called inadmissible. Several examples 
follow based upon classification systems already described in 
Section 2. 

In the case of classification by source, the most elementary 
constraint would take the verbal format: ^*A11 documents from 
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source a or symbolically simply a . Then admissible operators with- 
in the classification by source are disjunction and complementation 
but not conjunction* (See tables in figures 3*1 or 3«2) Recalling 
the convention adopted for documents having multiple authors, we 
note that a document cannot normally be from two or more sources 
simultaneously* 

In the classification system using descriptors (set G), all 
logical operations are admissible, but we shall adopt the convention, 
(see Chapter 2, Section 3 ) that no descriptor will be admissible if 
its presence always implies the presence of another (admissible) 
descriptor * 

In the case of hierarchical classification even this restric- 
tion must be dropped and all possible logical operations give rise 
to admissible constraints* 

For a summary of the admissible and inadmissible constraints 
for several classifications see figures 3*1 3*2. 

For somewhat the same reasons discussed in the case of the 
document record or catalog card in Section 2 we shall restrict 
logical operations between the various classification systems to 
that of conjunction (both-and) only. This is not a significant 
restriction in practice bwcause a raw request involving disjunction 
(either-or) between constraints on different classification systems 
may be broken into two or more separate ideal requests* 
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(by hierarchical 
subject heading) 
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2. The Space of Requests 

Let every possible ideal request be fulfilled conceptually by 
at most a denumerable number of possible documents. Note that an 
impossible document in this sense would only be one which incorpo- 
rated a logical constradiction, but we have eliminated the description 
of such * impossible** documents by permitting only admissible con- 
straints in our ideal requests. 

First, we consider the set of all possible documents, U, Note 
that U is infinite but denumerable. This may be viewed as a hypothe- 
tical library, which would yield an infinite number of appropriate 
documents in response to every conceivable request for information 
V7hich we might make. Again, as in the case of the actual library, 
we consider the collection, of all possible sets of members of U, 
Conceptually, at least, the members of U may be classified as to author, 

date, subject, etc,, resulting in the denumerably infinite partially 

’K' ’K 

ordered sets and lattices A D G H ,,. and A D G H ,,, As in the 
finite case, each of these is generated by the members of L distin- 
guished by a particular system of classification. In this space of 
possible documents, the idealized request is somewhat analogous to 
the catalog card. These requests may be considered as generating a 
lattice similar in most ways to B, described in Chapter II, However, 
there are significant differences. 

Unlike the catalog cards in the space of actual documents, the 
requests are not necessarily points in R, Indeed, one request may 
contain many others, and the same possible document may fulfill all 
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the requests in a chain. The space H is a lattice^ the cardinal 
product of several lattices, not all of which are Boolean (because 
of the restriction to admissible constraints.) R has a denumerable 
infinity of elements, the possible documents. It is required to 
map R onto the space B of sets of actual documents, which is a 
cardinal product of Boolen lattices with a finite number of elements. 
3. The Retrieval Homomorphism 

The mapping is a homomorphism (DEF.^CXI) by T; R-^B. In what 
follows, components (DEF.XX) of R will be denoted by underlining. 

If we exclude the inadmissible constraints as logical operations 
v;e must define T differently for the various components in the direct 
products A D G H . . • R and ADGH. . . .- B. For instance, T: A--^ A is 
a join-homomorphism only and '3:D*^D is a homomorphism of the two 
spaces as semi-groups only. 

On the other hand, T:H— is a lattice-homomorphism between 
their respective closures. In this case we may look at H as includ- 
ing H, since all actual subject headings are included among all 
possible ones. Then a particularly simple characterization of T is 
as the equivalence relation on H which generates H as its convex 
sublattice of ideals (DEF.XXII and Theorem X). This characterization 
of T can be extended to all components of the direct products if the 
inadmissible constraints are re-admitted using as their operational 
definitions the expressions appearing in column three of figure 3*2. 
Mow it is possible to describe B as the direct product of convex 
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sub lattices of R generated by the homomorphism T taken as an equiva- 
lence relation on R (Theorem XI ). Each request in R defines a 
direct product of ideals in B, one from each component lattice. 

In terms of this structure, it is possible to describe most 
of the verbal statements concerning the properties of classification 
and retrieval systems which are necessary or in some sense desirable. 
For example, those requests in R which cannot be filled are the analog 
of zero in ^ j Also redundancy among several classification 

systems may be expressed by stating that the center (DEF.XXIII) of B 
is not empty, for in an irredundant system of classifications the 
maximal distributive sublattice of B are in fact the several 
components of the direct product, (Theorem XII ), 

•i 

U , Summary 

In summary, we have seen that the retrieval function may be 
defined as a lattice homomorphism of the set of all possible ideal 
requests onto the set of subsets of documents, T:R — >B, More- 
over, each request defines a direct product of ideals in B, one from 
each classification system under which the documents are catalogued. 
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IV 



CODING FOR INFORMATION RETRIEVAL 

1, Definitions 

Every material aid to the storage and retrieval of information, 
from the catalog card to the digital computer requires that the infor« 
mation it handles be put in some particular format or language, ^9j 
And as the assisting apparatus grows more complex, its natural 
language seems to depart the further from that of the humans it 
serves. Therefore, a part of the problem of information processing 
is that of translation from one language to another in the most effec~ 

II Si 

tive and economical manner. This is what we will call coding , 10 

We shall first describe three basic types of codes which have been 
used in information retrieval systems. Discussion of their advantages 
and limitations will be put, as much as possible, on a mathematical 
basis, Secondly, we shall attempt to gain some insights into 

the problem of coding from information theoretic considerations. 

For simplicity, and because the vast majority of data processing 
devices use binary arithmetic, we will discuss coding in terms of 
binary digits or ^i^its**. 

The coded information may be thought of as being represented 
by charged or non-charged spots on magnetic drums or tape, holes in 
cards, paper tape, or for that matter, notches in a post, 

2. Direct Coding 

A direct code , by definition, uses a single bit (or notch) to 
represent a single idea. In this system, if we have H different ideas 
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to code, we will require II bits in each record. Or, to state the 
matter in a different way, if our record is planned to contain H 
bits (either bit either 1 or O), direct coding will force us to 
analyze our subject matter in terms of not more than H concepts. 
Direct coding of itself cannot cause the searching operation to 
produce extra records. Records selected by testing for 1 in a 
given bit position have been coded for a given concept - no more 
and no less. Within the limits imposed by the restricted number 
of concepts in terms ot which subject matter can be analyzed when 
using direct coding, it affords the simplest and most convenient 
method for conducting direct search operations. A single run 
through by a machine of limited memory is all that is needed. 

3 . Selector Coding 

Direct coding, as we have seen, attaches meaning to a single 
bit position. It is also possible to attribute significance to a 
combination of bit positions. When this is done in such a way as 
to minimize the amount of searching in order to isolate records 
characterized by some one combination of bits in a particular field, 
the resulting scheme is termed a ^ ^selector code **. 

If we let C denote the number of combinations of H things taken 
Y at a time, then 

r H". 

^ " y'. (Ti^)*. 

Foi* example, if we attach meaning to a combination of two bits 
in a field of size five, then we can indicate any one of ten concepts 
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in that field, as we find by substituting in equation (4.1) 



C = 



5'. 



1 . 2 . 3 . 4 .^ 
(1.2)(l.2.3) 



10 



^.T5^ ■ 

If the value of Y in equation (4.1) were unity, then the code 
would revert to the direct type. From a mathematical point of view 
a direct code is the limiting case of selective coding. 

The maximum number of combinations for fixed field size, H, 
is obtained when 

^ ^ largest integer equal to or 



less than ^H. 
U .2 ) C 



H*. 



max 



[|h]'. [iH-Y+l] '. 



Thus, with a fixed number of spots in a field of size eight 
the maximum number of combinations is 70. 

8 ’. 



c = 



T.(8.'4)'.' 



= 70. 



In order to evaluate the usefulness of selector codes it must be 
kept in mind that they permit only one concept to be coded in each 
field. Hence, selector codes are principally useful for coding a 
single member of a group of mutually exclusive concepts (disjoint) 
classes) of which the date of publication and the name of the 
publisher are good examples. Selector codes have found their greatest 
use in edge punched cards where the sorting is accomplished by running 
needles through the holes of the field in question. Since fewer holes 
are required (than in direct coding) for the same number of coding 
possibilities, selector coded cards can in general be arranged in 
sequence with less effort. 
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U. Sequence Coding 

Hinimum effort in sorting records into a predetermined sequence 
is the chief advantage of sequence codes > Sequence codes, like selector 
codes, are based on the principle of attributing meaning to a combination 
of bits in a fixed field. But in this case the number of bits to be 
used is not fixed. Sequence codes are based on taking all possible 
combinations by letting Y vary from zero to H. They make available for 
subject-matter analysis a number of concepts equal to the sum (C^) of 
a series of selector codes (c). 

Y=:H 

^.Th-TT*.' 



ih.3) 






Y=0 

A sequence code offers many more coding combinations than a 

g 

selector code. Thus, in a size eight field, we have 2 = 256 

possibilities vice 70 for selector coding. However, it is still true 
that only one combination of the sequence code in question can be 
punched in any individual field. 

It may be shown that searching for any one combination punched 
in a sequence-coded field of size H will require that all H positions 
be examined. On the other hand, in a selector coded field, an item 
is identified precisely once the Y one* s have been located, and only 
H-1 positions need ever be examined, 

5. Coding Efficiency 

Up to now we have considered only one coding field. In practice 
of course, combinations of fields are used, and an individual record 
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may be divided into a number of fields. Iiov;ever^ with the types of 
coding discussed so far it is not possible to make more than one 
entry in a given field. Hence^ we are limited to information which 
may be analyzed in terms of mutually exclusive concepts. We will 
now direct our attention to other coding methods not subject to the 
above mentioned restrictions. 

First it is necessary to introduce the idea of efficiency of 
codifications . 

To determine how many symbol positions are required in an un- 
ambiguous codification using K different symbols, we consider each 
symbol as a number in the base K and use enough positions to count the 
number of items in the base K. That is, with N items or concepts and 
K symbols, the number of symbol positions in a codification must be 
no smaller than the smallest integer n such that 

(4.4) k’^ ^ N (or n log„ N) 

K 

if the codification relating to an item is to be unique. 

For the binary case which we have discussed so far, K = 2, and for 
600 concepts, say, we have: 

( 4 . 5 ) n log^ 600 = ’■ = 9*261 or n = 10. 

If more than n positions are actually used, the code is said to be 
redundant ; if fewer than n, irredundant ; and if exactly n, efficient . 

It is possible to utilize a purposely redundant code in order to detect 
and automatically correct errors in the recorded information introduced 
during storage or transmission. 13 However, the self-correcting 
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scheme that makes use of redundant positions must also be able to 
correct errors that may occur in the extra redundant positions as 
well. This self-correcting feature requires considerable additional 
operations to be employed by the data processing system. Hence, 
its use may be uneconomical from the point of view of operating 
time and equipment as well as the additional storage space required. 
The amount of redundancy necessary depends on the depth of error 
we desire to correct. It may be obtained by combining equations 
(4.3) and (4.4) to obtain: 

(4.6) R = logj^ 

where M is the level of error correctable in a field of size H 
by the use of R redundant positions. Then the effective number 
of positions left for actual use in coding is H-R. 

6. Superimposed Coding 

On the other hand, if we have available a field of H bit 

H 

positions, then there is, as noted above, room for H = 2 concepts 
to be coded efficiently. Suppose, however, we know that fewer than 
N of these actually appear in a particular situation. Furthermore, 
suppose that we are dealing with items each of which is asso- 
ciated with up to X of these concepts, then, at least X log^ N 
bit positions must be used to identify an item, or X fields of 
the size H available. But we are here considering situations in 
which they do not all actually appear. The question arises: ^ How 



1= ri 
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can we use fewer than X log^ N bits for coding, say li, and still dis- 
tinguish among all M concepts? 

By the superimposed code corresponding to X concepts, we mean 
the result of forming the logical sum of the individual sequence or 
selector codes for these X concepts. 

This procedure permits the use of one field of size H in place 
of X such fields to code one item. However, V7e must evidently pay 
a price for this convenience and saving in item coding positions. 

For if items are to be selected from the file on the basis of fewer 
than X concepts, then it is possible for an item to be selected 
because the logical sum of the codes for the desired characteristics 
has units in positions which are among those in the superimposed 
code associated with a different combination of concepts. 

Appendix B contains an illustrative example of superimposed coding 
and also several formulas for the probability, p, that an item not 
associated with a particular characteristic will be selected during 
the search for that characteristic. The decision to use superimposed 
coding depends on the relative importance in a particular application 
of saving space versus the amount of false selections that can be 
tolerated. Note, however, that no item having the desired characteris- 
tics will be missed in a search merely because this technique was 




used in coding it. 
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A ll\ZE MODEL FOR INFORMATION RETRIEVAL 
1. Hierarchical Classification as a Haze 

Another possible model for information retrieval systems is that 



ularly apt to most library users. This interpretation is especially 
helpful in shedding light on the particularly difficult case of 
hierarchical classification vjith cross references. A portion of a 
subject catalog system linked by cross references is shovm in out- 
line form in Figure The same area of the catalog is characterized 

graphically in Figure 5*2. The subject catalog achieves its maze 
attributes as a result of the sets of cross references linking the 
subject headings. Although it was developed originally as a search 
aid^ the difficulty of keeping in mind much more than point to point 
search properties is a limitation to its usefulness. 

2. Search Strategy 

Given an initial subject heading related to the search require- 
ments^ the searcher would most certainly be Vielped in forming a strategy 
of search if he were able to study the character of the subject heading 
maze in a region around the point of entry. 

Figure ^.1 Library of Congress Classification Scheme 



of an abstract maze. 




It is an analogy which may appear partic- 
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In a TDachine search^ this would appear to require pattern-recognizing 
devices such as the perceptron, or special techniques of topological 
classification, ^l6, 17 j which are currently under development. 
However, there is a simple algorithm usable by man or machine which 
can be used to reorient a maze with respect to any arbitrary point 
of entry. 

3 . Maze Reorientation 

This algorithm is essentially the first part of one developed 
by E. F. Moore [ finding the shortest path through a maze. 
This reorientation proceeds as follows: Select the origin point 

and tag it with a zero. Select all points which have not been tagged, 
but which are adjacent to the points which have been tagged and 
provide them with the tag i'(i = l,2,...,k). Repeat this process 
until the k* th level points are tagged, where k is the depth to which 
it is desired to penetrate the maze on this pass. 

Figure 5*3 displays a graph of the same catalog maze oriented 
with respect to a particular point of entry. The most practical use 
of such a procedure appears to be in a retrieval system allowing 
man-machine * cross-talk.*' The user could specify the point of entry 
or it could be made random. The machine would orient the catalog 
maze to his point of entry out to (say) k steps. It would then 
organize a search strategy according to the incidence of terms 
presented by the user to describe his goal. The search might be 
continued until a specified number of documents had dropped out. 
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(These may bo thought of as *‘dead end*’ lines in the graph.) The 
machine could repeat the reorienting algorithm as often as necessary^ 
beginning anew at the one of the kth intersections characterized by 
the greatest number of the descriptive terms given originally. At 
any time^ a reoriented graph might be produced for the users perusal 
and decision concerning a ne\j point of entry, or broadening or 
narrowing of the search prescription. 

The suggested retrieval system, while requiring rather sophis- 
ticated data processing machinery would permit a sort of browsing which 
is greatly desired by most users. j^lj It^ s employment in a particular 
case would depend mainly on cost considerations. 

In the case of a strictly hierarchical classification this 
technique has obvious simplifications. 

4 . An Empirical Classification System 

The coordinate indexing system may also be described by a maze, 
but there is little advantage gained. However, there might be super- 
imposed on the coordinate index some sort of statistically developed 
association trails j^2j between the various descriptors. These 
would indicate the relative frequency with which a given combination 
of terms appear together in the record of a document. In such an 
empirically generated catalog system, all connections would be cross 
references, obtained not a priori from the descriptive terms but a 
posteriori from the documents described by them. The highly complex 
maze thus obtained would be susceptible to the searching procedure 
outlined above. 
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PHILOSOPHY AND PRELIMINARY CONSIDERATIONS 

1. General Criteria 

The first part of this paper has discussed the theory of infor- 
mation retrieval and suggested several mathematical models in terms 
of which this theory may be organized and described „ The second part 
will be concerned with an experimental information retrieval system 
designed and put into actual operation as a practical example (and 
possible critique) of these theories. However, the design of such 
a system, while founded on the mathematical models, is as much an 
exercise in systems engineering as in mathematics. With this in 
mind, the following criteria are proposed for use in the com- 

parison of existing practical systems and also as constraints to be 
met in the synthesis of new information processing system: 

1, Size of the file to be covered 

2, Rate of growth of the file and system 

3, Range of inquiries to be serviced, or the purposes to 
be served 

4. Range of subject matter to be covered 

5. Kinds of concepts to be represented 

6, Specificity and type of analysis 

7. Personnel required to do the analysis 

8. Cost of processing information and conducting searches 

9* Reliability of results, or probability of retrieval 

10. Form of system 
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2. Specific Constraints 

In the particular case of this example, further very practical 
considerations of economy, time, and human engineering imposed addi- 
tional constraints. Of the existing library contents, classification 
and cataloging system, and clerical procedures, only the latter might 
be changed at all, and that as little as possible. The system could 
only be mechanized using the presently available equipment, and that 
only on a time sharing basis with many other projects of higher 
priority. On the other hand, the fact that no new equipment was to 
be obtained specifically for the system, permitted the use of trial 
and error as a legitimate improvement technique in the synthesis. 
Moreover the cooperation of those whose daily employment would be 
most directly effected by the system was outstanding, in marked con- 
trast to nearly all reports on the introduction of automation into 
industry. ^ 26 , 2?j 

The technical reports section of the U. S. Naval Postgraduate 
School Library contains about 150,000 items and is growing at a rate 
of about 5>000 per year. The items vary in size from thin pamphlets 
to folios. They are stored in file cabinets, or are shelved individ- 
ually or in boxes. About 60,000 have a security classification of 
confidential or higher and must be stored in a locked or guarded 
area. The reports are published by government agencies or by private 
institutions under government contract. They are of a scientific, 
engineering, or technical-operational nature. 
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On receipt the reports are assigned an accession or shelf number 
which locates them in storage but conveys no other information. They 
are cataloged as to source, report number (if any) assigned by the 
source, title, author(s), and date of publication. A sample of the 
catalog card is shown in figure 1.1. 

Figure 1.1 
Sample Catalog Card 

U-14-6.301 

Institute of Flight Structures, Columbia University 
TN 1 

Vibrations and stability of plates under initial 
stress, by G. Herrmann and A. Armenakas. February 1959 

The only cross filing is by source. At the same time, the report 
is classified by being assigned any number of descriptors or uni- 
terms each describing some phase of its contents, and its shelf 
number is entered on the card corresponding to each of these 
descriptors in a coordinate index file. The list of descriptors 
is open-ended in that new ones may be invented as the cataloger 
feels necessary. The descriptors may be words or word roots em- 
bodying general concepts; they may be technical terms; or even 
names of projects and weapons systems. 

The reports are of an essentially transitory usefulness, which, 
however, may be measured in terms of months or years. Nevertheless, 
regular weeding of material obsolete in the judgment of the 
catalogers, is considered a virtually impossible task. ^ 

The technical reports library is used primarily bu graduate 
students in engineering who are writing research papers or perform- 
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ing laboratory projects in connection with establishing courses. 

A typical informal inquiry by one of them might be: **What do you have 

on plasma propulsion systems?** The library is also used by staff 
and faculty members who are more likely to have in mind a specific 
source or author; furthermore their advice to students as to where 
to obtain information on a particular subject usually takes this 
same form.' In short, the inquiries tend to be either extremely 
general or extremely specific. 

The equipment available with which to realize an automatic 
system consisted of: 

CDC l 60 h high speed digital computer [2^ 

CDC 1607 magnetic tape input-output equipment 29 



IBM 


card 


punch |30j 


-] 


IBM 


card 


to paper tape translator 


30 



IBM7I7 ^ 757 magnetic tape controlled printer 30 
IBM 402 accounting machine [31] 

Friden Flexowriters [3I] 

Remington-Rand Synchro- tape typewriters I 32 j 

One of each of the last two items was available in the library 

itself. The rest were available on a time shaping basis in the 

USNPS Computer Center. 
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DESIGN OF SABIRS 

1. Logical Design 

The logical design of this Setni-Automat ic Bibliographic 
Information Retrieval System (hereafter known as SABIRS) is based, in 
general, on the lattice model described in Part 1« Each document in 
the library is represented by a record which is the logical product 
or intersection of the distinguished se.ts to which the document be- 
longs under three different methods of classification: (a) by source 

(b) by date of publication, and (c) by uniterms. 

a is the set of all documents in the library originated 
by that particular agency or company, 
b is the set of all documents in the library published 
during that particular month. 

c is itself the intersection of n sets (n s 1 through 12) 
each one of which consists of all the documents in the 
library having to do with a particular Uniterm. 

In addition, each record carries as stag the shelf number of the 
document it represents. As an example, the document whose catalog 
card is shown in figure 1.1 would have the following record: 

Source: Columbia University 

Date : February, 1959 

Uniterms: Elasticity, Stress, Plates, Motion, Stability. 

Self Number: U-46, 301 
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The raw requests for information by clients are idealized by being 



put into a standard form, an example of which is shown in figure 2.1, 

Figure 2.1 
Sample Request Form 

NAME; W. T. Door 

TELEPHONE ; Ex. 988 

ROUTING OR BOX f 1439 

■ * 

SOURCE(s): N.A.S.A. 00102064 

Lockheed Aircraft Co. 00100732 

Applied Physics Lab, 00100037 



DATE OF PUBLICATION: From March. I 96 O DATE6003 

To present THRU9999 

UNITERMS ; Satellite 00006437 

Navigation 00003207 

Doppler 00000462 



INSTRUCTIONS: If "any" source and/or date and/or subject is desired, 
leave corresponding code blank. No more than 15 total code-words may 
be entered. No more than 12 of them may be uniterms. 
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The only additional constraint on the formal request besides those 
described in Part I, Section 3> is that no disjunction of uniterms 
is permitted. Every such disjunction attempted requires the initia- 
tion of an additional request. 

Each record on file has the following Boolean algebraic form: 
s .d, u- or, broken down to individual uniterms, 
n 



( 21 ) 



s .d 
J ^ 



i=l 



n ^ 12, where s denotes source, d denotes 



date, and t denotes Uniterm. 

Each admissible ideal request has the following Boolean 
algebraic form: 



m 



( 2 . 2 ) 



j=l / 



^ \ 




n 


t “J 


• 


1 1 t^, n ^ 12 , q 


k=j / 




i=l 



The retrieval system selects a set of documents from among those 

/ 

in the library. This set has a finite (possibly zero) number of 
members. That set represented by the request is denumerably inifi- 
nite. The correspondence is the homomorphism described in Part I, 
Section 3. 

2. Functional Design 

The functional design of SABIRS was guided primarily by the 
general philosophy and particular constraints described in Section 1. 
The three types of classification chosen were selected on the 
basis of past experience by the staff of the Technical Reports 
Library. It was considered highly desirable for a human operator 
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to be able to distinguish at a glance a source name from a Uniterm 
in their coded form* Because of the rapid increase of publishing 
agencies, weapons system designators, etc*, it was believed necessary 
to allow for at least 100,000 possible sources and uniterms* |^5i33 
It was convenient to utilize IBM binary coded decimal characters and 
to limit the length of a coded record or request to 120 such 
characters to allow direct use of the IBM ’Jl’J Line Printer* Further- 
more, it was extremely convenient, because of the 48 bit word length 

of the CDC l604 Computer, to make each code eight characters* 

1 

Because of the high speed and large memory of the l6o4, more 
economical coding was not considered necessary* Because of the 
semi-automatic nature of the system, with its frequent use of human 
intervention, easily decodable forms were considered highly desire- 
able * 

The present coded record of 15 eight-character words could 
easily be compressed if necessary; however it permits relatively 
systematic expansion of the capabilities of the system when the 
need arises* At present, over 75^000 records can be stored on one 
reel of magnetic tape* It is believed that future developments will 
be directed toward the storage of more information per record 
rather than toward physically shorter records* 

Actual operating time on the CDC l6o4 computer is held to 
a minimum by off-line preparation of input data on the Remington- 
Rand Synchro-Tape Machine and by off-line print out of the results 
on the IBM Line-Printer* It is estimated that actual computer 
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time for a daily run with a library of up to 150>000 records will be 
under five minutes. This includes the updating of the file of re- 
cords which is accomplished at the same time as the search for infor- 
mation o 

The master program for the 1504 phase of the system operation 
is a program generator type of data processing compiler rather than 
a formula interpretor. This compiler is^ of course, highly 

specialized but completely self-contained. It is believed that 
more future requirements will be able to be filled by additions to 
the executive program utilizing subroutines already available. 

Considerable care has been taken to make the operation of the 
system straightforward and simple. The operating personnel are 
library staff members for whom some additional training is, of course, 
necessary; this consists mainly of enough practice, under instruction, 
to become facile in the operation of the machines. A wide leeway 
in input format is permitted before an error occurs in the output. 
There are a number of signals to indicate possible errors and, in 
addition, the input as read by the computer is printed as an output 
along with the results for cross checking. The following chapter 
contains a detailed description of the system operation, error signals, 
and particular capabilities and limitations of the system. 
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OPERATION OF SABIRS 
1. Standard Operating Procedure 

The routine operation of the Semi-Automatic Bibliographic 
Retrieval System is outlined in the flow charts of figure 3*1 
(a through C). These should be followed through before reading the 
additional details listed below. 

Starting the Main Computer Program 
The search generator or compiler program for the Control Data 
Corporation l604 Computer is stored on a reel of magnetic tape in 
assembly routine *AR*' format. It must be entered into the computer 
memory from the tape unit designated four utilizing current operating 
procedures with the Computer Center* s standard library of subroutines. 
With the compiler in memory and the most recent file of records on 
tape unit two, the reel of punched paper tape should be placed in 

ft ti 

the Ferranti reader and the latter set to character mode. 

After raising the start switch, the program will normally continue 
to completion without further intervention by the operator. Error 
signals which may appear on the consol typewriter are discussed in 
the next section. Upon completion, the computer stops at address 06037o 

Interpreting The Output 

The output from the computer appears on magnetic tape as shown 
in figure 3* IB. It may be printed as desired on the IBM Line-Printer. 
Examples of typical outputs and the inputs which generated them are 
contained in Appendix C The bibliography appears as a list of shelf 
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Figure 3«1A Operational Flow Chart for SABIRS 
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Figure 3 •IB Operational Flow Chart for SABIRS 





Figure 3»1C Operation Flow Chart of SABIRS 
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numbers headed by the identifying name or number of the request which 
they fill. This list is printed across the page with the tag name on 
the left. Every time this tag appears the shelf numbers to the right 
represent additions to the bibliography generated by that request. 
Following all the bibliographies, the input requests as read by the 
computer, are printed to facilitate the detection of errors and to 
permit the carrying along of clerical information (see Appendix C.). 
Next, are printed the new records just as they were added to the 
file. Thirdly, the list of shelf numbers which were deleted from 
the file is printed. Each line of this list is begun by the title 
To delete . 

The fourth block following the bibliographies is headed by the 
title ^errors** and lists an identifying tag (the contents of the 
first word) every line of data which the program failed to print 
in its proper place in the output because of some technical error 
(parity, line length, incorrect format, etc.). Finally, a count is 
given of six items of interest in the run: 

1. The number of blocks of item submitted to be 

ft n 

deleted; titled to delete . 

2. The number of records submitted to be added to 
the file; titled added . 

3 . The number of requests submitted to be filled; 
titled requests 

4 . The number of errors found (in the sense of the 

' % III M 

list in block four described above); titled errors . 

5 . The actual number of records now on file at completion 
of the run; titled records . 

6. The number of items actually deleted from the old 

^ 0 1 01 

file during this run; titled deleted „ 
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The file of records itself may be printed on the line printer 
if desired^ Examples of such outputs are also shown in Appendix A. 
The simplest method for making changes in a record on file is to list 
its shelf number for deletion and the correct record with same number 
as an addition to the file. Both can be accomplished at the same 
operating session. Note that the compiler does not maintain any 
order of shelf numbers on magnetic tape. 

2. Error Analysis and Correction 

The complier program may cause several error signals to be 
typed on the consol typewriter in the course of its operation. 

Unlisted 

This means that a character has appeared on the input tape 
which is not included in the dictionary of meaningful symbols. A 
space is substituted in the output. Operation is not halted. If 
the unlisted character appeared in a shelf number of an item to be 
deleted, the record will not have been deleted and should be repeated 
on the next day* s run. If the unlisted character appeared in a 
record to be added, the record is now incorrectly stored in the file. 
Therefore it* s number should be included as one to be deleted and 
the record should be repeated correctly among those to be added on 
the next day* s run. If the unlisted character appeared in a re- 
quest, the bibliography produced may contain items which are not 
pertinent. Hence, the request should be repeated correctly in the 
next day* s run. 
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Too Long 

This means that more than 120 characters^ exclusive of spaces, 
have appeared in the input without a double period markp This 
particular typographical error is singled out because it may cause 
invalidation of two or more requests or new records » However, the 
operation is not halted. The error and its consequences will be 
immediately obvious from the printed output . Corrective measures are 
the same as those for the * unlisted** signal. 

Not Even 

This means that the number of characters between two pairs of 
double period marks is not an even multiple of eight, as it should 
be if correct code words only are employed. Operation is not 
halted. This may not indicate an error if it occurs in a request 
where additional clerical information has been added after the 
coding (see Appendix C). In all other cases, a typographical error 
is indicated which may invalidate the results. The same remarks 
made in the case of the * unlisted** signal pertain here. 

Mode 

This means that the paper tape reader is not in the character 
mode. Operation of the program halts at address O6065 o Clear 
computer, restart paper tape in character mode, and begin again. 

Taperror 

This signal appears in the event of any of several errors occur- 
ing in the reading or writing of magnetic tape. It indicates that 
a line of output data has been dropped. An identifying tag for the 
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line dropped appears in the list of "errors”' in the output « (See 
Section 1 . ) 

tt Si 

No Date 

This error signal appears not on the typewriter^ but on the 
magnetic tape output. It is printed in the bibliographic list com* 
piled in response to any request in which no specific range of dates 
of publication is given. This usually does not indicate an actual 
error, but it is noted because a typographical error of almost any 
kind in a request which does specify date will cause the computer 
program to ignore the date specification. On the other hand, the 
computer will be unaware of mistakes in the other fields except for 
those noted by the previous error signals. 

3 . Special Capabilities and Restrictions 

SABIRS, as actually programmed, allows for considerable 
variation in format and procedure over and above the routine opera* 
tion already described. Some of these capabilities and the rules 
for using them are described in this section. 

Direct File Copying 

It is possible to copy from one magnetic tape reel to another 
a complete or partial file of records without change and without 
using any paper tape input. Place the file of records to be copied 
on tape unit two and the blank reel on tape unit three. Set selective 
jump key two. Start with the program address register equal to 

os ill 

07000. After pertinent error signal is taperror which has the 
same meanings as given in Section 2 of this Chapter. 
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Format of a Record 



The format of a record of a document is precisely fixed, as 
the following example shows: 

U005347^00l0000600005912b0000Uit-60000l6T50000l676 . . 

Each eight characters is^ a separate code o Reading from left to 
right, the first eight characters is the shelf number; the second 
eight characters is the code for the source of the document; the 
third eight characters is the date of publication; and the last three 
sets of eight characters each represents a Uniterm or descriptor 
associated with the document o The double period indicates the end 
of the record. Thus, in the case above: 
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/ Of 01 

is the shelf number (the U indicates the 
unclassified area of the stacks). 


00100006 


is the source code. 


00005912 


is the date of publication (December, 1959)* 


00000446 


is the code for a Uniterm. 


00001675 


is another Uniterm. 


00001676 
• # 


is another Uniterm associated with this 
document , 

marks the end of the record (there may 
have been as many as nine more uniterms). 



The order, shelf number, source code, date, uniterms-- is fixed. 
Furthermore, each code consists of exactly eight characters, and 
zeros to the left must be typed (punched). None of the fields may be 
omitted. Up to 12 uniterms may be included, for a maximum of 120 
characters to a record. Each record must end with a double period 
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which is not included in the character count o Spaces may be used 
anywhere in a record for ease in reading the typed copy; they are 
ignored by the computer and are not included in the character count. 

Format of A Request 

On the other hand, the format of a request allows for consider- 
able variation, as may be seen from the numerous examples in 
Appendix C. The first eight characters are taken as a tag which 
identifies the request throughout the system's operations. After this 
tag, codes representing sources and uniterms may be mingled in any 
order as long as each one consists of eight characters. The coding 
for date in a request may also be placed anywhere among the other 
codes, but it consists of 16 characters in the fixed forms 
datexxxxthruyyyy, where the'x^s and y* s represent two year-month date 
codes as in the record. Conventionally, one codes **since March I 96 O*' 

as "date6003thru9999*' ‘‘before March I 96 O** as *'date0000thru6003 
After all codes are entered, the request may be filled out to a maxi=> 
mum of 120 characters (not including spaces which are ignored) with 
any clerical or other information desired. Every request must end 
with a double period. 

Other Restrictions In Format 

A list of items to be deleted consists of their shelf numbers 
(each eight characters) typed (punched) successively following the 
title **to delete**, a double period after each set of up to Ik items. 

The input punched paper tape may contain a date in the forms 
06/08/61 (June 8, 1961) followed by a double period. The deletions, 
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additions, and requests may be listed in order p In one run, up to 
1000 items may be deleted, 5 OOO new records may be added and 6k re- 
quests filled at the same timep 

Use of Multiple Input Tapes 

The input may consist of many physically separate pieces of 
paper tape* The breaks in input may occur anywhere, but each piece 
of tape must end with ^seventh level** punch « This stops the program 
until another tape is loaded in the reader and the compiling is resumed 
by raising the start switch. When the last piece of tape is loaded 
and before starting the computer, selective jump switch 1 should be 
raised , 



Use of Multiple File Tapes 

When the file of recotds is of such size as to require the use 
of more than one real of magnetic tape for its storage, the following 
procedure should be used: 

(a) Place completely filled reel of tape on unit two and 
blank reel on unit three, 

(b) Do not set jump key two. Start program as usual, feeding 
in one or more paper tapes of input, 

(c) Program will stop and wait for magnetic tape reel to be 
changed . 

(d,l) If no updating is taking place, replace file tape on 
unit two with another reel of file, restart program 
from current pause, 

(d,2) If updating of the file is being accomplished, remove 
and replace both unit two and three reels of magnetic 
tape. Put another file reel on unit two and a blank 
on three as usual. Restart from current pause condition, 

(e) Whenever the last reel of file is about to be processed, 
set jump key two before restarting. 
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APPENDIX A 



SURVEY OF LATTICE THEORY^ 

We begin with an undefined entity^ S, a set (or collection) of 
elements or members^ a^b,c, oooo^ whose nature is immaterial but which 
may well be sets themselves o 
1. Partial Order 

Definition I: The set A is a subset of the set B if and only 

if each member of A is also a member of Bo A is called 
a proper subset of B if (l) it is a subset of B and (2) 
there exists a member of B which is not a member of Ao 
Definition II: A partially ordered set is a system consisting 

of a set S and a relation ^(““greater than or equals'*' or 
* contains*^^) satisfying the following postualtes: 

For all X, x^x (Ref lexivity ) 

Pg If X ^ y and y ^ then x=y (Anti-symmetry ) 

P^ If x$ V and v then x z (Transitivity) 

If X and y are any elements of S, we may have x ^ y or 
y ^ X or neither « 

Definition III: If every pair of elements of a partially 

ordered set S are comparable (either y or y^^x)^ then 
S is said to be linearly ordered or to be a cha i n p 

1 . 

The material in this appendix is taken chiefly from comprehensive 
works by Birkoff 20^21^ ^ Birkoff and MacLane ^22] ^ Jacobson 
^ 23 ] smd Hermes ^2Uj « Since each theorem and definition appears 
in at least three of these references^ individual citations have not 
been made. 
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If X ^ y and x j6 y ^ then we will write x> y, and we also agree to 
use y^x and y< x as alternatives for x^ y and x > y« 

Definition 1V% In a finite partially ordered set, we say that 
X is a cover of y if x y and no u exists such that 
X > u > y o 

It is clear that, if x > y in a finite partially ordered set, then 
we can find a chain 

X = Ui > Ug > . . . > = y 

in which each covers Conversely the existence of such a 

chain implies that x > y« 

2. Lattices 

Definition Vi An element u of a partially ordered set S 

is said to be an upper bound for the subset A of S if 
u^ a for every a in A« If u is an upper bound and 

v for any upper bound v of A, then u is a least 

upper bound (l<,u«b«) of A. 

Definition VI s An element u of a partially ordered set S 
is said to be a lower bound for the subset A of S if 
u^ a for every a in A. If u is a lower bound and 

u ^ v for any lower bound v of A then u is a 

greatest lower bound (g,l<,b«) of Ac 
We denote the l.u«bo of x, y by x -i- v (^x union y”*, *'x or y'’*) and 
the g.l.b, by xy (*x intersecty^^, "’x and y**)o 
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Definition VII: A lattice is a partially ordered set in 



which any two elements have a least upper bound and 
a greatest lower bound. 

Definition VIII: A lattice, L, is said to be closed if any 
(finite or infinite) subset A = a. has a l.u.b. T* a. 
and a g.l.b. a^. 

Given a partially ordered set S, we mean by the lattice closure of S, 

* 

the smallest closed lattice, S , which contains S. 

Theorem 1: A set L is a lattice if and only if the following 
algebraic identities hold: 

L. For all x, xx = x+x = x 
xy = yx and x+y = y + x 

X (yz) = (xy) z and x -f (y+z) = (x+y) + z 
x(x-hy) X 4* xy = X 

Definition IX: A subset M of a lattice L is called a 

sublattice (of L) if it is closed relative to the 
compositions union and intersection. 

It is evident that a sublattice is a lattice relative to the 
induced compositions. On the other hand, a subset of a lattice may 
be a lattice relative to the partial ordering ^ defined in L without 
being a sublattice. For example, let G be a group, let B be the 
lattice of subsets of G, and L be the lattice of subgroups of G. 

Then it is clear that L is contained in B, and that 11^ ^ 
the same significance in these two sets. On the other hand, if 
and are subgroups, then 4 as defined in B is the set 
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sum of these^groups o In general, this is not a subgroup; hence, it 
differs from the 4 - as defined in L as the smallest subgroup of 
G containing their set sum<, 

3. Types of Lattices 

Definition X: A lattice is called modular if it satisfies 

the condition 



If X ^ y, then x(y + z ) = y + xz 

Definition XI: A lattice is called strongly distributive if 

it satisfies the condition 

Lg x(y -f z) = xy 4 - xz 

Theorem II; If three elements in a lattice satisfy Lg, then 
they satisfy its dual 

Lgd (x 4“ y)(x + z )=2 X 4- yz 

Theorem III; In all finite lattices, there exist elements 0 
and I which are universal lower and upper bounds 
respectively; that is, for all x, 0 ^ x I . 

Definition XII; A lattice is called complemented if it 
satisfies the condition that for every x in L there 
exists an x“ such that 

xx'^ 0, X 4“ x^ = I 

Definition XIII; A lattice is called Boolean if it is both 
strongly distributive and complemented « 

4, Chain Conditions 

Let a and b be two elements of a modular lattice such that 
a ^ bo We consider nov; the finite descending chains 
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a 






^2^ 



> a = b 



connecting a and 

Definition XIV: One such chain is called a refinement of a 

second if its terms include all the terms of the other 
chain^ Two chains are said to be equivalent if their 
terms can be put in one-to-one correspondence such that 
corresponding intervals chains are 

isomorphic * 

Theorem IV: Any two finite descending chains connecting the 

elements a,b (a ^b) of a modular lattice have equivalent 
refinements « 

Definition XV: A composition chain connecting a^b, a > b is a 

finite sequence ' 

a = ai> a^ >as> >a^ = b 

in which each a, is a cover of a. . o 
i 1+1 

We assume for simplicity now that L contains 0 and I, and we take 
a = I, b = 0 in. the foregoingo 

Definition XVI s If there exists a composition chain in 

connecting I and 0, L is said to be of finite length <> 

The number of intervals of this chain is called the 
length (dimension) of Lo 

Theorem V: A modular lattice with 0 and I is of finite length 

if and only if the following two chain conditions hold: 
Descending chain condition o There exists no 
infinite properly descending chain^ a> b> c> in Lo 
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Ascending chain condition . There exists no 



infinite properly ascending chain, a< b< c . in L, 

Definition XVIII: We say that an element a in L is ( intersection 

or meet) reducible if a = b-b^ where the b.> a. 

Definition XVIII: We say that the representation of a as the 

intersection of m elements is irredundant if the inter- 
section of any m-1 of them contains a properly « 

Theorem VI: The number of terms in any two irredundant 

representations of an element as g.l.b, of irreducible 
elements is the same. 

Definite XIX: An element p of a lattice with 0 is called a 

point if p is a cover of Oo 

Theorem VII: If L satisfies the descending chain condition, 

L contains points. 

Theorem VIII: If L is a complemented modular lattice that 

satisfies both chain conditions, then the element I of 
L is a l.u.bo of independent points o Conversely, if L 
is a modular lattice with 0 and I such that I is a loU.b, 
of a finite number of points, then L is complemented 
and satisfies both chain conditions, 

5. Cardinal Products 

Definition XX: Let X and Y be sets, each with a partial 

ordering relation ^ , By the cardinal product XY, we 
mean the set of all pairs x,y (x in X, y in Y), where 
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(^;y) ^ means u in X and v in Y« X and Y are the 

component s of the cardinal product « 

Theorem IX: The cardinal product LM of two lattices L and M 

is also a lattice* Furthermore^ LM will satisfy one 
of the previously stated conditions if and only if 
both L and M satisfy L. * 

6. Homomorphisms of Lattices 

We shall now consider many-to-one correspondences T: L M 
between lattices* The follov;ing three properties (among those 
which T may possess) are of interest* 

( 1 ) X ^ y implies T(x) ^ T(y) 

( 2 ) r(xy) = T(x)T(y) 

(i) T(x V y) = T(x) + T(y) 

Note that T may possess all or none of these properties* Also^ 

(2) implies (l); (3) implies (l); but (l) does not imply either 
(2) or (3). 

Definition XXI; T is called isotone if it satisfies (l) but 
not (2) nor (3), 

T is called a meet -homomorphism if it satisfies 

(2) but not (3). 

T is called a join-homomorphism if it satisfies 

(3) but not (2). 

T is called a lattice-homomorphism if it 
satisfies ( 1 ), (2), and (3). 
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Definition XXII: A subset J of elements of a lattice L is an 



ideal if and only if 

X in J and y in J imply x + y in J, and 
X in J and v ^ x imply v in J„ 

As usual, a lattice-homomorphism of L can be used as an equivalence 
relation to partion L into congruence classes of elements. 

Theorem X: The antecedents of any element under any lattice- 

homomorphism form a convex sublattice , or sublattice 
which contains with any a,b^ all elements between a and b. 
In complemented lattices, every congruence relation (hence every 
lattice-homomorphism) is determined by the ideal of elements con- 
gruent to 0. 

Theorem XI: Any complemented lattice of finite length is 

a cardinal product of ^'simple** complemented lattices* 

Definition XXIII: The center of a partly ordered set P with 

0 and I is the set of elements e in P which have one 
component I and the others 0 under some direct factori- 
zation of P as a cardinal product of other partially 
ordered sets. 

Theorem XII: The center of a Boolean lattice L is the inter- 

section of its maximal Boolean sublattices. 
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APPENDIX B 



EXAMPLE OF SUPERIMPOSED CODING 

The following example is intended to illustrate the principle 
of superimposed coding* 

In a 20 bit computer word let three characteristics have the 
following codes: 



Characteristic Position 

1 2 3 ^ 5 6 7 8 9 10 11 12 13 1^ 15 16 17 18 19 20 

1st 01000100000 1 00 1 00000 

2nd 001001000 0 1 0 0 0 0 1 0 0 0 0 

3rd 000000010 0 1 0 0 0 1 0 0 0 0 1 



then the corresponding superimposed codification is 

0110010100" 11001100 01 

With this example we shall first illustrate how superimposed coding 
can be made to use fewer than X log^ N bits and yet maintain the 
ability to distinguish among N characteristics* Suppose that 
N = ^, 500 . Let us code the characteristics as follows; use 20 bit 
positions, and let each code contain precisely k units* This system 
will allow us to code each characteristic uniquely, because 20 
positions can be taken four at a time in 
20* 

= li'' ' ( ' 20 ' - ' Ty ~ ~ different ways, and 

k > U,500. Finally, suppose that each item is associated with 
three characteristics; that is, X = 3« Then let the coding of the 
item be the superimposed codings of its associated three character- 
istics. Note that the coding of an item now requires only 20 bit 
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positions, rather than the 39 considered necessary for efficient 
coding; that is, X log^ N = 3 log^ 4500 = 39- 

Now it is clear that the item coded in the example above will 
be selected if all items having the first characteristic are searched 
for. Similarly for the second or third characteristic. However, 
we must evidently pay a price for this convenience and saving in item- 
coding positions. Suppose that the code for some other characteris- 
tic is 

0010 00001 00001 00001 0000 



Our item is not associated with this characteristic but it will be 
selected in a search for items associated with this characteristic, 
The reason for this is that our codifications overlapped, allowing 
for more than three combinations of four ones to give rise to the 
item coding. In the example, nine ones appear, allowing for 

= 126 



9 1 

4 



9'. 

v,(9 - hj: 



such possibilities, and hence 126 - 3 - 123 possible false com- 
binations. If not all the 4,845 possible characteristics actually 
appear, the situation is alleviated somewhat. 

The use of superimposed coding therefore requires the determina- 
tion of the probability p that an item not associated with a partic- 
ular characteristic will be selected during the search for that 
characteristic. The formulas given below are taken from Ledley |l2j 
and assume that the codes for characteristics and the association of 
items with characteristics are random. 
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Let H- number of positions in superimposed coding. 

Y =• number of ones in a characteristic code. 

X= number of characteristics associated with an item. 



i=Y-l 



H 

Y 



- 1 



Then 



Where 
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and so forth . 

For the example here, H 20, Y 4, X 3; hence 
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APPENDIX C 



EXAMPLES OF PRODUCTION DATA 

This appendix contains examples of some typical data processed 
by SABIRS I, the first version of a semi-automatic bibliographic 
information retrieval system actually installed and operating on a 
limited basis at the Technical Reports Library of the Uo S* Naval 
Postgraduate School , 

Figure C«1 is a portion of a typical file of records as re- 
corded on magnetic tape for use of the systemo An actual file may 
contain up to 75>000 records on one reel of tape<, 

Figure C ,2 is an example of input data; it contains the 
following: 

(a) Four records to be added to the file. Note that two 
of these have the same shelf number, date, and source, 
but different uniterms . This demonstrates the handling 
of a record whose analysis gives rise to more than the 
twelve uniterms permitted in each record. 

(b) Two lists of items to be deleted for a total of four 
deletions. Note the clerical information appearing 
here and after the requests. Any such notes may be 
entered after the coded data up to a total of 120 
characters per block of information. 

(c) Seven requests for bibliographic information. Note 
that the identifying tags used (in this case the names 
of the requestors) must be exactly eight characters 
long not counting spaces. 

Figure C.3 is the information output from this input data. 
There are seven blocks of data separated by blank lines, 

(a) Bibliographies requested. Note that each time a partic- 
ular tag appears, that line consists of more documents 
to be included in the particular bibliography. 
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(b) Copy of the requests as received, 

(c) Copy of the new records to be added, printed as received 

by the computer, 

(d) Copy of the lists of items to be deleted, as received, 

(e) List of errors, (See Chapter III, section 2, 

Taperror ) , 

(f) Accounting data and date. 

Figure C,4 shows the same portion of a file as in Figure C.l 
after updating by the additions and deletions appearing in Figures C ,2 
and C.3. 
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Figure C,1 Portion of File of Records 



to delete uOO^JlhJ u0057106 superceded by final reports.. 

UOO57036 00100126 00005911 00006444 00002104 00002017 00001715 

00001640 00003403 00001615 00001503 

00001670 00002116 00003157 00001623-. 

U0057036 00100126 00005911 00003150 00000774 oooo4o4o 00005730 

00002306 00000340.. 

McCalla/ 00000055 date 6006 thru 9999 for Lt . T.R. McCalla, box I677.. 
Jauregui 00100043 00002047 for Lt . S. Jaureguij, box 329- « 

05/19/61 . . 



UOO57IO9 00100167 00006005 00002021 00001727 00001706 00002004 

00001743 00001203 00004324 00004710 
00005457 . . 
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Figure C .2 Sample Input Data Containing Deletions, Additional Records, 
and Requests for Information. 
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Figure C»3 Output Resulting from Input Shown in Figure C.2 
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